Objective: Explore the dataset to identify differences between the customers of each product.
Come up with a customer profile (characteristics of a customer) of the different products. Perform univariate and multivariate analyses. Generate a set of insights and recommendations that will help the company in targeting new customers.
The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables-
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Command to tell Python to actually display the graphs
%matplotlib inline
#open-source Python graphing library for building beautiful, interactive visualizations, and showing exact numbers
!pip install plotly
import plotly.express as px
Requirement already satisfied: plotly in c:\users\munee\anaconda3\lib\site-packages (5.11.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\munee\anaconda3\lib\site-packages (from plotly) (8.1.0)
#Mount Google drive
#from google.colab import drive
#drive.mount('/content/drive')
# read the dataset
df1 = pd.read_csv("CardioGoodFitness.csv") # please replace '/content/tips.csv' with the location of the dataset in your drive
#copy dataset to not override the original dataset
df = df1.copy()
The initial steps to get an overview of the dataset is to:
df.head()
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | TM195 | 18 | Male | 14 | Single | 3 | 4 | 29562 | 112 |
| 1 | TM195 | 19 | Male | 15 | Single | 2 | 3 | 31836 | 75 |
| 2 | TM195 | 19 | Female | 14 | Partnered | 4 | 3 | 30699 | 66 |
| 3 | TM195 | 19 | Male | 12 | Single | 3 | 3 | 32973 | 85 |
| 4 | TM195 | 20 | Male | 13 | Partnered | 4 | 2 | 35247 | 47 |
df.shape
(180, 9)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 180 entries, 0 to 179 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product 180 non-null object 1 Age 180 non-null int64 2 Gender 180 non-null object 3 Education 180 non-null int64 4 MaritalStatus 180 non-null object 5 Usage 180 non-null int64 6 Fitness 180 non-null int64 7 Income 180 non-null int64 8 Miles 180 non-null int64 dtypes: int64(6), object(3) memory usage: 12.8+ KB
Age, Education, Usage, Fitness, Income, and Miles columns are numerical columns while the rest are categorical in nature (Gender, MaritalStatus , and Product are 'object'). There are no missing values here.The primary difference between categorical and object type columns is that categorical columns help to improve the processing speed and reduces the storage space when we are working with a large dataset.
df.describe()
| Age | Education | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|
| count | 180.000000 | 180.000000 | 180.000000 | 180.000000 | 180.000000 | 180.000000 |
| mean | 28.788889 | 15.572222 | 3.455556 | 3.311111 | 53719.577778 | 103.194444 |
| std | 6.943498 | 1.617055 | 1.084797 | 0.958869 | 16506.684226 | 51.863605 |
| min | 18.000000 | 12.000000 | 2.000000 | 1.000000 | 29562.000000 | 21.000000 |
| 25% | 24.000000 | 14.000000 | 3.000000 | 3.000000 | 44058.750000 | 66.000000 |
| 50% | 26.000000 | 16.000000 | 3.000000 | 3.000000 | 50596.500000 | 94.000000 |
| 75% | 33.000000 | 16.000000 | 4.000000 | 4.000000 | 58668.000000 | 114.750000 |
| max | 50.000000 | 21.000000 | 7.000000 | 5.000000 | 104581.000000 | 360.000000 |
df['Product'].unique()
array(['TM195', 'TM498', 'TM798'], dtype=object)
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product | 180 | 3 | TM195 | 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 180.0 | NaN | NaN | NaN | 28.788889 | 6.943498 | 18.0 | 24.0 | 26.0 | 33.0 | 50.0 |
| Gender | 180 | 2 | Male | 104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Education | 180.0 | NaN | NaN | NaN | 15.572222 | 1.617055 | 12.0 | 14.0 | 16.0 | 16.0 | 21.0 |
| MaritalStatus | 180 | 2 | Partnered | 107 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Usage | 180.0 | NaN | NaN | NaN | 3.455556 | 1.084797 | 2.0 | 3.0 | 3.0 | 4.0 | 7.0 |
| Fitness | 180.0 | NaN | NaN | NaN | 3.311111 | 0.958869 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| Income | 180.0 | NaN | NaN | NaN | 53719.577778 | 16506.684226 | 29562.0 | 44058.75 | 50596.5 | 58668.0 | 104581.0 |
| Miles | 180.0 | NaN | NaN | NaN | 103.194444 | 51.863605 | 21.0 | 66.0 | 94.0 | 114.75 | 360.0 |
df.duplicated().sum()
0
df.isna()
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 175 | False | False | False | False | False | False | False | False | False |
| 176 | False | False | False | False | False | False | False | False | False |
| 177 | False | False | False | False | False | False | False | False | False |
| 178 | False | False | False | False | False | False | False | False | False |
| 179 | False | False | False | False | False | False | False | False | False |
180 rows × 9 columns
#check each column for missing values
print(df.isnull().sum())
Product 0 Age 0 Gender 0 Education 0 MaritalStatus 0 Usage 0 Fitness 0 Income 0 Miles 0 dtype: int64
There are no missing values.
Let's check the distribution for numerical columns.
Observations on Age
sns.histplot(data=df,x='Age',stat='density')
plt.show()
sns.boxplot(data=df,x='Age')
plt.show()
fig1 = px.box(df, x="Age", points="all")
fig1.show()
Observations on Education
sns.histplot(data=df,x='Education',stat='density')
plt.show()
sns.displot(data=df,x='Education',kind='kde')
plt.show()
sns.boxplot(data=df,x='Education')
plt.show()
fig2 = px.box(df, x="Education", points="all")
fig2.show()
Observations on Usage
sns.histplot(data=df,x='Usage',stat='density')
plt.show()
sns.displot(data=df,x='Usage',kind='kde')
plt.show()
sns.boxplot(data=df,x='Usage')
plt.show()
fig3 = px.box(df, x="Usage", points="all")
fig3.show()
Observations on Fitness
sns.histplot(data=df,x='Fitness',stat='density')
plt.show()
sns.displot(data=df,x='Fitness',kind='kde')
plt.show()
sns.boxplot(data=df,x='Fitness')
plt.show()
fig4 = px.box(df, x="Fitness", points="all")
fig4.show()
Observations on Income
sns.histplot(data=df,x='Income',stat='density')
plt.show()
sns.displot(data=df,x='Income',kind='kde')
plt.show()
sns.boxplot(data=df,x='Income')
plt.show()
fig5 = px.box(df, x="Income", points="all")
fig5.show()
Observations on Miles
sns.histplot(data=df,x='Miles',stat='density')
plt.show()
sns.displot(data=df,x='Miles',kind='kde')
plt.show()
sns.boxplot(data=df,x='Miles')
plt.show()
fig6 = px.box(df, x="Miles", points="all")
fig6.show()
Observations on Product
sns.countplot(data=df,x='Product')
plt.xticks(rotation=90)
plt.show()
Observations on Gender
sns.countplot(data=df,x='Gender')
plt.xticks(rotation=90)
plt.show()
Observations on Marital Status
sns.countplot(data=df,x='MaritalStatus')
plt.xticks(rotation=90)
plt.show()
# Check for correlation among numerical variables
num_var = ['Age', 'Education', 'Usage', 'Fitness', 'Income', 'Miles']
corr = df[num_var].corr()
# plot the heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data=df, diag_kind="kde")
plt.show();
sns.pairplot(data=df.drop(["Fitness"],axis=1),hue="Product", diag_kind="kde")
plt.show()
fig = px.box(df, x="Product", y="Age", points="all")
fig.show()
fig = px.box(df, x="Product", y="Miles", points="all")
fig.show()
sns.catplot(x='Product', y='Miles', data=df, kind="bar", hue='Fitness')
plt.xticks(rotation=90)
plt.show()
sns.catplot(x='Product', y='Income', data=df, kind="bar", hue='Education')
plt.xticks(rotation=90)
plt.show()
sns.catplot(x='Product', y='Miles', data=df, kind="bar", hue='Fitness')
plt.xticks(rotation=90)
plt.show()
sns.catplot(x='Product', y='Income', data=df, kind="bar", hue='Gender')
plt.xticks(rotation=90)
plt.show()
sns.catplot(x='Product', y='Education', data=df, kind="bar", hue='Gender')
plt.xticks(rotation=90)
plt.show()
# Check if the trend is among usage of the product and age
df.groupby(by = ['Product','Usage'])['Age'].mean()
Product Usage
TM195 2 27.789474
3 29.189189
4 28.681818
5 22.500000
TM498 2 29.428571
3 28.967742
4 28.750000
5 26.333333
TM798 3 22.000000
4 28.222222
5 30.833333
6 29.571429
7 28.500000
Name: Age, dtype: float64
plt.figure(figsize=(15,7))
sns.lineplot(data=df, x="Age", y="Miles", hue='Gender', estimator='mean', ci=False)
plt.ylabel('Miles')
plt.xlabel('Age')
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(data=df, x="Age", y='Income', hue='Product', estimator='mean', ci=False)
plt.ylabel('Income')
plt.xlabel('Age')
plt.show()
sns.catplot(x='Product', y='Age', data=df, kind="bar", hue='Usage')
plt.xticks(rotation=90)
plt.show()
No missing values indicated here.
# data.shape[0] will give us the number of rows in the dataset
# selecting the instances where missing value is greater than 0
pd.DataFrame({'Count':df.isnull().sum()[df.isnull().sum()>0],'Percentage':(df.isnull().sum()[df.isnull().sum()>0]/df.shape[0])*100})
| Count | Percentage |
|---|
df.head()
| Product | Age | Gender | Education | MaritalStatus | Usage | Fitness | Income | Miles | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | TM195 | 18 | Male | 14 | Single | 3 | 4 | 29562 | 112 |
| 1 | TM195 | 19 | Male | 15 | Single | 2 | 3 | 31836 | 75 |
| 2 | TM195 | 19 | Female | 14 | Partnered | 4 | 3 | 30699 | 66 |
| 3 | TM195 | 19 | Male | 12 | Single | 3 | 3 | 32973 | 85 |
| 4 | TM195 | 20 | Male | 13 | Partnered | 4 | 2 | 35247 | 47 |
numeric_columns = ['Age', 'Education','Usage', 'Fitness', 'Income', 'Miles']
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Q1 = df[numeric_columns].quantile(0.25)
Q3 = df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1 #Inter Quantile Range (75th percentile - 25th percentile)
lower_whisker = Q1 - 1.5*IQR #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
# Percentage of outliers in each column
((df[numeric_columns] < lower_whisker) | (df[numeric_columns] > upper_whisker)).sum()/df.shape[0]*100
Age 2.777778 Education 2.222222 Usage 5.000000 Fitness 1.111111 Income 10.555556 Miles 7.222222 dtype: float64
Treating outliers
I will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively. This is because some of the variables contain few variables that may alter the data results.
Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
# all the values smaller than lower_whisker will be assigned the value of lower_whisker
# all the values greater than upper_whisker will be assigned the value of upper_whisker
# the assignment will be done by using the clip function of NumPy
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
df = treat_outliers(df,'Age')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Age')
plt.show()
fig = px.box(df, x="Age", points="all")
fig.show()
df = treat_outliers(df,'Income')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Income')
plt.show()
fig = px.box(df, x="Income", points="all")
fig.show()
df = treat_outliers(df,'Miles')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Miles')
plt.show()
fig = px.box(df, x="Miles", points="all")
fig.show()
df = treat_outliers(df,'Education')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Education')
plt.show()
fig = px.box(df, x="Education", points="all")
fig.show()
df = treat_outliers(df,'Usage')
# visualizing the column after outlier treatment
sns.boxplot(data=df,x='Usage')
plt.show()
fig = px.box(df, x="Usage", points="all")
fig.show()
numeric_columns = ['Age', 'Education', 'Usage', 'Income','Miles']
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
sns.catplot(x='Product', y='Income', data=df, kind="bar", hue='Gender')
plt.xticks(rotation=90)
plt.show()
sns.catplot(x='Product', y='Education', data=df, kind="bar", hue='Gender')
plt.xticks(rotation=90)
plt.show()
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product | 180 | 3 | TM195 | 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 180.0 | NaN | NaN | NaN | 28.747222 | 6.828608 | 18.0 | 24.0 | 26.0 | 33.0 | 46.5 |
| Gender | 180 | 2 | Male | 104 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Education | 180.0 | NaN | NaN | NaN | 15.533333 | 1.507523 | 12.0 | 14.0 | 16.0 | 16.0 | 19.0 |
| MaritalStatus | 180 | 2 | Partnered | 107 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Usage | 180.0 | NaN | NaN | NaN | 3.419444 | 0.996031 | 2.0 | 3.0 | 3.0 | 4.0 | 5.5 |
| Fitness | 180.0 | NaN | NaN | NaN | 3.311111 | 0.958869 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| Income | 180.0 | NaN | NaN | NaN | 52440.242361 | 13488.830441 | 29562.0 | 44058.75 | 50596.5 | 58668.0 | 80581.875 |
| Miles | 180.0 | NaN | NaN | NaN | 99.874306 | 42.163049 | 21.0 | 66.0 | 94.0 | 114.75 | 187.875 |
We analyzed a dataset of nearly 180 treadmill product informations. The main feature of interest here is the age of customers, usage of the products, and the customers' characteristics in buying the products. From a business perspective, having a certain treadmill is dependent upon the price and product features. Thus we determined the factors that compose of the customers' profile in buying the treadmill product.
We have been able to conclude that -